Statistical and causal relations

Douwe Molenaar

Systems Biology Lab

2024-11-01

Introduction

Statistical and causal relations

  1. Causal relations, mechanisms, are at the heart of science
    • All scientific reasoning is causal in nature
    • Causal reasoning is extremely powerful, especially when chained together

Statistical and causal relations

  1. Causal relations are inferred from correlations
    • Preferably obtained in highly controlled laboratory experiments
      • One-factor-at-a-time method (OFAT)
      • Mutating the dam gene affects sensitivity to triclosan in E. coli
    • Often obtained under less controlled conditions
      • Randomized controlled trials (RCT)
      • Ozempic causes weight loss
    • Often obtained from purely observational studies (in the field, in society, …)
      • many things may change at the same time
      • Smoking causes lung cancer
      • Does glyphosate cause Parkinson’s disease?

Can we infer causal (mechanistic) relations from statistical relations?

  • Is it possible to infer causal relations when many things change at the same time?
  • Is it possible to infer causal relations from purely observational experiments?
  • The importance of assumptions/knowledge about the data generating process (the mechanism)
  • The use of causal graphs1 as an aid in reasoning

Correlation and causation

You may have heard this statement:

Correlation does not imply causation


Here we state that it does:


Correlation implies causation!

But not always in the way that you think it does

More precisely

  1. From an analysis of data we conclude that there is a statistically significant correlation1 between variables
  2. By this statement we mean that we firmly believe that repeating the experiment will yield the same result
  3. Then, that correlation must have a mechanistic basis
  4. Mechanisms have a causal description
  5. Hence, a causal explanation must be at the basis of an observed correlation, or


Correlation implies causation

Then, what is the controversy about?

  1. Accidental correlations: a reproduction of the experiment would yield a different result
    • There is no mechanism
    • The correlation is irreproducible
    • Unavoidable by-product of statistical tests
  2. The causal relation involves a confounder1
    • There is a mechanism
    • But it involves an indirect causal relation

Example of accidental correlation (I think)

More serious example of accidental correlations

  • When the number of predictors \(\gg\) number of samples
    • This is the case in most “omics” experiments

Simulated example

  • Out of 1000 random numerical predictors, 65 are “statistically significantly” correlated with a categorical response variable in a T-test on 50 samples.

Fortunately, there are ways to control this error

We will get back to this in January in the course Biosystems Data Analysis

Example of indirect cause (I think)

Stork couples and birth rates of human babies in European countries


The p-value of a correlation test on these data equals 0.0079.

Discussion

It is obvious why many call an accidental correlation a spurious correlation, spurious meaning false or fake.

However, why do they also call the correlation between storks and birth rates a spurious correlation? What aspect is false or fake about this correlation?


Do you agree that both cases are often lumped together under the term spurious correlation?

Why bother about causality?

Causality is baked into our DNA

  • DNA is the instructions for a machinery, a highly predictable series of causes and effects
  • Molecular mechanisms are machines, to create predictable patterns of cause and effect
  • Evolution has selected machinery that computes beneficial effects on fitness

Causal reasoning is baked into our brain

We are easily tricked to draw false conclusions. Also trained scientists get caught in this trap …

Gut microbial composition (microbiome) and autism

microbiome \(\longrightarrow\) autism

autism \(\longrightarrow\) dietary preference \(\longrightarrow\) microbiome


The controversy continues …

(note the second author of this paper and the one above)

Simpson’s paradox

Or why we need to train our intuition

A drug is tested on a group of 800 persons. This is the outcome:

When split according to sex

Should we treat or not treat?

Cause and effect: why bother?

Petabytes allow us to say: “Correlation is enough.”

Chris Anderson, “The End of Theory”, WIRED Magazine

  • Correlation is not a mechanism
  • Only a mechanism allows generalization
    • Reason about situations beyond those in which the data were obtained
    • Reason about most plausible causes
    • Reason about likely effect of interventions

Examples:

  • Carbon dioxide production by humans is the cause of climate change
  • Smoking is a cause of lung cancer

Causal graphs

Graphical record of dependencies

  1. Random variables \(X\) and \(Y\) are independent:

  1. Random variables \(X\) and \(Y\) are dependent:

The source of genuine dependency is causation

  • \(X\) and \(Y\) are dependent
  • The causal direction is undecidable from data alone

\(X\) causes \(Y\) (Varying \(X\) leads to variation in \(Y\))

\(Y\) causes \(X\) (Varying \(Y\) leads to variation in \(X\))

Three variables

In case of three random variables we have three posible patterns

  1. Mediator: \(Y \longrightarrow X \longrightarrow Z\)
  2. Confounder (common cause): \(Y \longleftarrow X \longrightarrow Z\)
  3. Collider: \(Y \longrightarrow X \longleftarrow Z\)

What can we say about the joint probability distributions \(P(Y,Z)\)?

  1. \(Y\) and \(Z\) are dependent: \(P(Y,Z) \neq P(Y) \cdot P(Z)\)
  2. \(Y\) and \(Z\) are dependent: \(P(Y,Z) \neq P(Y) \cdot P(Z)\)
  3. \(Y\) and \(Z\) are independent: \(P(Y,Z) = p(Y) \cdot P(Z)\)

What can we say about the conditional joint probability distributions \(P(Y,Z|X)\)?

  1. \(Y\) and \(Z\) are conditionally independent: \(P(Y,Z|X) = P(Y|X) \cdot P(Z|X)\)
  2. \(Y\) and \(Z\) are conditionally independent: \(P(Y,Z|X) = P(Y|X) \cdot P(Z|X)\)
  3. \(Y\) and \(Z\) are conditionally dependent: \(P(Y,Z|X) \neq P(Y|X) \cdot P(Z|X)\)

Conditional independence/dependence

We have seen dependence/independence of variables from a joint distribution

Independence, \(X \perp Y\):

\[ P(X,Y) = P(X) \cdot P(Y) \quad \text{or} \;\; P(Y|X) = P(Y) \]

Dependence \(X \not\perp Y\):

\[ P(X,Y) \neq P(X) \cdot P(Y) \quad \text{or} \;\; P(Y|X) \neq P(Y) \]

Now it is time to define conditional independence and conditional dependence


Important: we consider joint distributions of at least three random variables to define these new concepts

Conditional independence/dependence

Conditional probability in the presence of a third variable

\[ P(X,Y\,|\,Z) = \frac{P(X,Y,Z)}{P(Z)} \]

Two variables are independent conditionally on a third variable (conditionally independent) when the following holds for their joint distribution:

\[ P(X,Y \,|\, Z) = P(X \,|\, Z) \cdot P(Y \,|\, Z) \]

Or briefly \(X \perp Y \,|\, Z\)

  • Note that even if \(X\) and \(Y\) are dependent they may still be conditionally independent (on a third variable)

What is the source of conditional independence?

Example from biochemistry: activity of genes

Observed activities

Underlying probability distribution function

Conditional probability distributions

\[ P(Y\,|\,X) = \frac{P(X,Y)}{P(X)} \]


Example: \(P(Y\,|\,X=450)\)

Example from biochemistry

X and Y as functions of a third gene Z

Proposed causal diagram

Continued

According to the scheme

  • Dependence between \(X\) and \(Y\) is due to variation in \(Z\)
  • Keeping \(Z\) at a constant value (conditioning on \(Z\)) should remove dependency
  • \(X\) and \(Y\) should be independent conditioned on \(Z\)

\[ P(X,Y \,|\, Z) = P(X \,|\, Z) \cdot P(Y \,|\, Z) \]




Examples: at \(Z=9\) and \(Z=34\)

Discussion

Is the proposed causal scheme (\(Z\) as a confounder or common cause) the only causal scheme explaining conditional independence?

No. We will also obtain conditional independence of \(X\) and \(Y\) if \(Z\) is a mediator:

  • \(X \longrightarrow Z \longrightarrow Y\)
  • \(X \longleftarrow Z \longleftarrow Y\)

Data alone do not provide the causal direction

Causal diagrams are Directed Acyclic Graphs (DAG’s)

  • Graphs consist of nodes and edges
  • Nodes represent Variables
  • Edges represent Causal relations
  • Edges are directed, pointing from cause to effect
  • An effect can not be its own direct or indirect cause: the diagram is acyclic

A DAG

Not a DAG

Causal diagrams are very simple mechanistic descriptions

  • We just assert that “A change in (the distribution of) \(X\) causes a change in (the distribution of) \(Y\)
  • We don’t describe the sign of the effect
  • We don’t describe the size of the effect
  • We don’t describe in which time frame the effect takes place

Working out a mock example

The success rate of bird nests

The number of fledgelings depends on the season

\[ \begin{align} \text{Season} &\longrightarrow \text{Fledglings}\quad \Rightarrow \\ \text{Season} &\not\perp \text{Fledglings} \end{align} \]

In fact, food is an intermediate cause

\[ \begin{align} \text{Season} & \longrightarrow \text{Food} \longrightarrow \text{Fledglings} \quad \Rightarrow \\ \text{Season} & \not\perp \text{Fledglings} \end{align} \]

Then, conditioning on Food should remove dependence

\[ \text{Season} \perp \text{Fledglings}\;|\;\text{Food} \]

In words:

  • Conditioning on the mediator removes dependency between cause and effect
  • Conditioning on the mediator “blocks the path” between cause and effect
  • Conditioning on the mediator “blocks information exchange between cause and effect”

Demonstration on a mock dataset

Expand to show code that generates data
# Generating mock data
set.seed(123)
n <- 20
seasons <- c('Spring','Summer','Autumn')
N <- n*length(seasons)
food <- c(8,2.4,0.2)
names(food) <- seasons
d1 <- tibble(
  Season = factor(rep(seasons, n), levels=seasons)
)
d1 <- d1 %>%
  mutate(Food = food[Season] * exp(rnorm(N,sd=1))) %>%
  mutate(Young = Food*exp(rnorm(N, sd=1))) %>%
  mutate(lgFood = log(Food), lgYoung = log(Young))

Effect of the season on fledglings

Modeling dependence of Food and Fledgings

Conditioning on Food, first approach

Residuals of Fledglings regressed on Food


  • Conditioning is carried out by regressing on Food
  • Residual variation in Fledglings is independent of Season

ANOVA on model Residuals ~ Season

term df sumsq meansq statistic p.value
Season 2 0.871 0.436 0.591 0.557
Residuals 57 42.023 0.737 NA NA

Conditioning on Food, second approach

Comparing linear models

  • Create models lgYoung ~ lgFood and lgYoung ~ lgFood + Season
  • Compare using ANOVA
Expand to show code
m0 <- lm(lgYoung ~ 1, data = d1)
m1 <- lm(lgYoung ~ lgFood, data = d1)
m2 <- lm(lgYoung ~ lgFood + Season, data = d1)
anova(m0,m1,m2) |>
  broom::tidy() |>
  kbl(digits=3) |>
  kable_classic(full_width=FALSE, font_size=18, position="left")
term df.residual rss df sumsq statistic p.value
lgYoung ~ 1 59 278.298 NA NA NA NA
lgYoung ~ lgFood 58 42.894 1 235.404 313.901 0.000
lgYoung ~ lgFood + Season 56 41.996 2 0.898 0.598 0.553

Conclusion

Since adding Season as a predictor does not significantly improve the model fit compared to using Food alone, the causal diagram is in agreement with the data.

Same data with additional seasonal dependency

Adding a seasonally dependent effect of predation





Conditioning on Food:

  • Removes correlation mediated by Food
  • Blocks path through Food from Season to Fledglings
  • Leaves open the path from Season to Fledgelings through Predation
Expand to show code that generates data
# Generating mock data
set.seed(1234)
predation <- c(0.2,4,6)
names(predation) <- seasons
d2 <- tibble(
  Season = factor(rep(seasons, n), levels=seasons)
)
d2 <- d2 %>%
  mutate(Food = food[Season] * exp(rnorm(N, sd=1))) %>%
  mutate(Predation = predation[Season] * exp(rnorm(N, sd=1))) %>%
  mutate(Young = (Food/Predation)*exp(rnorm(N, sd=1))) %>%
  mutate(lgFood = log(Food), lgYoung = log(Young), lgPredation = log(Predation))

Conditioning on Food

Effect of the season on fledglings

Modeling dependence of Food and Fledgings

Residuals of Fledglings is still dependent on Season

ANOVA on model Residuals ~ Season

term df sumsq meansq statistic p.value
Season 2 83.976 41.988 15.571 0
Residuals 57 153.702 2.697 NA NA

Second approach

  • Create models lgYoung ~ lgFood and lgYoung ~ lgFood + Season
  • Compare using ANOVA
Expand to show code
m0 <- lm(lgYoung ~ 1, data = d2)
m1 <- lm(lgYoung ~ lgFood, data = d2)
m2 <- lm(lgYoung ~ lgFood + Season, data = d2)
anova(m0,m1,m2) |>
  broom::tidy() |>
  kbl(digits=3) |>
  kable_classic(full_width=FALSE, font_size=18, position="left")
term df.residual rss df sumsq statistic p.value
lgYoung ~ 1 59 636.595 NA NA NA NA
lgYoung ~ lgFood 58 237.678 1 398.917 189.021 0
lgYoung ~ lgFood + Season 56 118.184 2 119.494 28.310 0

Conclusion

Since adding Season as a predictor significantly improves the model fit compared to using Food alone, the causal diagram does not agree with the data. In particular, there may be a second route from Season to Fledgelings.

Discussion

  • Propose a model to remove the dependency between Fledgelings on Season completely
  • What is the biological interpretation of its coefficients?

Solving Simpson’s paradox

Back to the data

Results of a medical experiment

Treatment Not recovered Recovered Total Recovery rate
Not treated 24 16 40 40%
Treated 20 20 40 50%

When splitting the result according to sex

Sex Treatment Not recovered Recovered Total Recovery rate
Female Not treated 21 9 30 30%
Female Treated 8 2 10 20%
Male Not treated 3 7 10 70%
Male Treated 12 18 30 60%

Analysis of the experiment


Sex Treatment Not recovered Recovered Total Recovery rate
Female Not treated 21 9 30 30%
Female Treated 8 2 10 20%
Male Not treated 3 7 10 70%
Male Treated 12 18 30 60%

  • Conditioning on sex in this case means: investigating the effect in the sexes separately.
  • By conditioning on Sex we “block a backdoor” and can isolate the effect of Treatment on Recovery.

A backdoor is a confounder. It has unblocked paths (paths without a collider) that have arrows ending both on the potential cause and potential effect.

Using ANOVA to solve this problem

  • Here we suppose that the response \(Y\) is a continuous variable, not the categorical “recovered”, “not recovered”
  • The data can be found on the server in the file simpsons.tab

Testing model y ~ treatment

Expand for code
# simpsons <- readr::read_tsv(f, comment="#")
model1 <- lm(y ~ treatment, data=simpsons)
anova(model1)
Analysis of Variance Table

Response: y
           Df  Sum Sq Mean Sq F value Pr(>F)
treatment   1   248.6  248.65  1.7854  0.183
Residuals 198 27574.8  139.27               


How to condition on Sex?

Conditioning on sex

Graphically as well as using linear modeling and ANOVA

Compare

  • model1: y ~ treatment
  • model2: y ~ treatment + sex

ANOVA comparing model1 and model2

term df.residual rss df sumsq statistic p.value
y ~ treatment 198 27574.8 NA NA NA NA
y ~ treatment + sex 197 18347.1 1 9227.682 NA 0


Summary of model2

term estimate std.error statistic p.value
(Intercept) 85.7 1.0 82.3 0
treatmentTRUE -10.1 1.6 -6.4 0
sexM 15.7 1.6 10.0 0

What else should we investigate in these data?

Fundamental question

How do we know that the causal relation is not like this? Does the data tell us anything about this?

There is no way of knowing this just from the data. We need to have additional information about the experiment and/or prior knowledge about the system.

Which experiment should we have performed?

  • How can you avoid Sex bias?
  • How can you avoid bias on any other (known/unknown) property of the subjects?

The randomized controlled trial (RCT)

  • There is no known mechanism by which Sex could influence (be a cause of) the outcome of the dice.
  • There is no known mechanism by which any potential confounder could influence the dice.
  • If we see a significant effect of treatment on recovery, then treatment MUST be the cause.

The role of confounding in the lab

The previous experiment is partly uncontrolled:

  • people have different genetic make-up
  • people eat different things, in general, behave differently
  • people have different physiological state

How about the highly controlled laboratory environment?

  • Controls are investigated by mr. A and mutants are investigated by mrs. B.
  • Controls are pipetted in the first column of the multiwell plate and all mutants in the remaining columns.
  • Controls were performed in the first half year and treatments in the second half year of the project.
  • Controls are done in the morning and mutants in the afternoon.

Mendelian randomization

Causal relations from purely observational data

Or, using an instrumental variable, (see syllabus)

You want to condition on a confounder

Example: microbiome

  • Why would you want to know the causal relation represented by the red or blue arrow?
  • How would you adjust for a putative cholesterol effect?

Colliders

A collider is an effect with two or more causes

  • The graph does not specify the type of dependency
  • Causes could be sufficient
  • Causes could be necessary but not sufficient

Statistical relations

\[ \begin{align} \text{Species X} &\perp \text{Disease} \\ \text{Species X} &\not \perp \text{Disease} \;|\; \text{Cholesterol} \end{align} \]

Example

  • Sufficient causes: high cholesterol when Disease present or when Species X present

Within “high cholesterol” subpopulation:

Conditioning on a collider induces dependency!




Microbiome researchers in trouble

They don’t know the right causal scheme

  • In case A they must control for (condition on) Cholesterol to find the direct effect
  • In case B they must not control for Cholesterol to find the direct effect

Conclusion: they are likely to make wrong conclusions when “scanning” a microbiome for effects1

Simplification of correlation networks

Data from the American Gut Microbiome

The full correlation network

The simplified correlation network

Questions

Debunking the stork story

Show, using linear models and ANOVA, that the causal scheme shown in slide 1.9 is in accordance with the data. In particular:

  1. Show that Surface is a good predictor of Birth_rate.
  2. Show that Storks provides no additional information about Birth_rate than Surface alone does.
  3. Investigate whether Humans (population size) is a good predictor of Birth_rate
  4. Investigate whether Surface provides no additional information about Birth_rate than Humans alone does.
  5. Argue why we can we conclude from these observations that the causal model is at leas in part in agreement with the data.
  6. Does the data (only the data!) rule out te possibility that Storks are a cause for Birth_rate?

The data are available here: https://few.vu.nl/~molenaar/courses/data/causalinference/storks_and_birth_rate.csv

Answers

Debunking the Stork story

Expand for code
d <- read_csv('https://few.vu.nl/~molenaar/courses/data/causalinference/storks_and_birth_rate.csv', comment="#")
lm0 <- lm(Birth_rate ~ 1, data=d)
lm1 <- lm(Birth_rate ~ Surface, data=d)
lm2 <- lm(Birth_rate ~ Surface + Storks, data=d)
lm3 <- lm(Birth_rate ~ Humans, data=d)
lm4 <- lm(Birth_rate ~ Humans + Surface, data=d)
lm5 <- lm(Birth_rate ~ Humans + Surface + Storks, data=d)
lm6 <- lm(Birth_rate ~ Storks, data=d)
a1 <- anova(lm0,lm1,lm2) 
a2 <- anova(lm0,lm3,lm4,lm5) # Surface adds information on top of Humans
a3 <- anova(lm0,lm6,lm2) # Data alone does not rule out effect of Storks on Birth_rate
a1 |>
  broom::tidy() |>
  flextable::flextable() |>
  flextable::autofit()

term

df.residual

rss

df

sumsq

statistic

p.value

Birth_rate ~ 1

16

2,690,208.0

Birth_rate ~ Surface

15

400,603.0

1

2,289,604.0

86.0

0.0

Birth_rate ~ Surface + Storks

14

370,796.0

1

29,807.0

1.0

0.3

  • Surface is a fair predictor of Birth_rate
  • Conditioning on Surface means including it as a predictor in the statistical model
  • When we condition on Surface we see from the ANOVA that Storks provides no additional predictive value
  • Storks tell no more about the birth rate in a country than its surface area already does
  • In addition to an indirect effect of Surface through Humans, there could be an additional other effect of Surface on Birth_rate.
  • The data are at least partly in accordance with the causal model (to the right)
  • The data alone can not rule out that Storks bring babies -> data alone do not provide a mechanism

Literature

More literature

A popular account of causal inference:

References

Hsiao, Elaine Y., Sara W. McBride, Sophia Hsien, Gil Sharon, Embriette R. Hyde, Tyler McCue, Julian A. Codelli, et al. 2013. “Microbiota Modulate Behavioral and Physiological Abnormalities Associated with Neurodevelopmental Disorders.” Cell 155 (7): 1451–63. https://doi.org/10.1016/j.cell.2013.11.024.
Kurtz, Zachary D., Christian L. Müller, Emily R. Miraldi, Dan R. Littman, Martin J. Blaser, and Richard A. Bonneau. 2015. “Sparse and Compositionally Robust Inference of Microbial Ecological Networks.” Edited by Christian Von Mering. PLOS Computational Biology 11 (5): e1004226. https://doi.org/10.1371/journal.pcbi.1004226.
Matthews, Robert. 2000. “Storks Deliver Babies (p= 0.008).” Teaching Statistics 22 (2): 36–38. https://doi.org/10.1111/1467-9639.00013.
Özcan, Ezgi, and Elaine Y. Hsiao. 2022. “Are Changes in the Gut Microbiome a Contributor or Consequence of Autism—Why Not Both?” Cell Reports Medicine 3 (1): 100505. https://doi.org/10.1016/j.xcrm.2021.100505.
Pearl, Judea, and Dana Mackenzie. 2019. The Book of Why. Penguin Books Ltd (UK). https://www.penguin.co.uk/books/289/289825/the-book-of-why/9780141982410.html.
Yap, Chloe X., Anjali K. Henders, Gail A. Alvares, David L. A. Wood, Lutz Krause, Gene W. Tyson, Restuadi Restuadi, et al. 2021. “Autism-Related Dietary Preferences Mediate Autism-Gut Microbiome Associations.” Cell 184 (24): 5916–5931.e17. https://doi.org/10.1016/j.cell.2021.10.015.